I have chosen Red Wine Quality dataset. You can download the data using this link.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm3)
alcohol (% by volume)
# Loading all the required packages
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
library(dplyr)
library(tidyr)
library(reshape)
library(memisc)
#Loading dataset
wine<-read.csv("wineQualityReds.csv")
names(wine)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(wine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Variable X is used for indexing the dataset. Let’s look at a general summary of the data.
summary(wine[2:13])
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Let’s look at the data.
head(wine)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Let’s plot a histogram of different variables. Since quality is factored variable, let’s factor it and plot a histogram.
I have added a new variable to the data namely, wine_quality which signifies the value of wine as low,average or high.
wine$wine_quality<- ifelse(wine$quality < 5, "low",
ifelse(wine$quality < 7, "average", "high"))
wine$wine_quality <- factor(wine$wine_quality,
levels=c("high", "average", "low"), ordered=TRUE)
attach(wine)
ggplot(aes(x=factor(quality)),data=wine)+geom_bar(color='black',fill='blue')
ggplot(aes(x=fixed.acidity),data=wine)+ geom_histogram(binwidth = 0.1,fill='orange') + theme_bw()+scale_x_continuous(breaks = seq(0,16,2))
The distribution of fixed acidity is right skewed. Let’s take a log transformation and see if we can fix it.
ggplot(aes(x=fixed.acidity),data=wine)+ geom_histogram(binwidth = 0.01,fill='orange') + theme_bw()+ coord_cartesian()+scale_x_log10(breaks=seq(0,16,2))
This appears more resonable now, with most values concentrated in 7-9 fixed.acidity.
Now let’s have a look at volatile.acidity.
ggplot(aes(x=volatile.acidity),data=wine)+ geom_histogram(binwidth = 0.01,fill='cyan') + theme_bw()+ coord_cartesian()+scale_x_log10(breaks=seq(0,1.6,0.2))
The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel, right skewed or normal.
ggplot(aes(x=citric.acid),data=wine)+ geom_histogram(binwidth = 0.01,fill='blue') + theme_bw()
ggplot(aes(x=citric.acid),data=wine)+ geom_histogram(binwidth = 0.1,fill='blue') + theme_bw()+ coord_cartesian()+scale_x_log10(breaks=seq(0,1,0.2))
summary(wine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
It’s not clear what is the distribution of this variable. It appears to be bimodal with two peaks at 0 and 0.5 using the first plot, and taking log transformation(2nd plot) doesn’t help either. Let’s take a look at it’s box plot.
ggplot(aes(x="citric.acid",y=citric.acid),data=wine)+ geom_boxplot() + theme_bw()
The boxplot shows the median value to be just above 0.25,and every point is within 1.5 times the Inter Quartile Range. Now let’s look at residual sugar histogram plot.
ggplot(aes(x=residual.sugar),data=wine)+ geom_histogram(binwidth=0.1) + theme_bw()
This drive is also right skewed. Let’s take log transformation and see if we can fix it after removing top 5% of the data.
df<-subset(wine,residual.sugar<quantile(residual.sugar,probs=c(0.95)))
ggplot(aes(x=residual.sugar),data=df)+ geom_histogram(binwidth = 0.05) + theme_bw()+ coord_cartesian()+scale_x_log10(breaks= seq(0,10,1))
Now this distribution looks normal. Similarly, let’s look at chlorides variable distribution.
df<-subset(wine,chlorides<quantile(chlorides,probs=c(0.95)))
ggplot(aes(x=chlorides),data=df)+ geom_histogram(binwidth = 0.01) + theme_bw()+ coord_cartesian()+scale_x_log10()+xlim(0,quantile(df$chlorides,probs=c(0.95)))
The alcohol content can be another important consideration when we are purchasing wine:
ggplot(wine,aes(x=alcohol)) + geom_density(color='black')
summary(wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It looks like the alcohol content of the wine in the dataset follows a lognormal distribution with a high peak at the lower end of the alcohol scale.
Let’s have a look at pH levels.
summary(wine$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).
Other observations:
The median quality is 6. Most wines have a pH of 3.2 or higher. About 75% of wine have quality that is lower than 6. The median percent alcohol content is 10.20 and the max percent alcohol content is 14.90.
I found out that citric acid has an unusual distrubution in the dataset. Since the data was tidy, I did no modification on my own.
We can quickly visualize the relationship between each pair of variables and find their pearson product-moment correlation.
ggscatmat(wine,columns = 2:13) + theme_minimal(base_size = 7)
From the plot, we can see that top 3 correlated variables with quality are alcohol, sulphates and citric.acid.
And most un-correlated variables are volatile.acidity, total.sulfur.dioxide and density. Now, this seems reasonable since in wine most acids used are fixed acids. Let’s look a few of these relationships in a bit more detail.
ggplot(aes(x=density,y=alcohol),data=wine)+
geom_point()
This plot doesn’t really tell much about the trend. Let’s add jitter and smooth the plot and fix a linear model so as to see what’s going on.
ggplot(aes(x=density,y=alcohol),data=wine)+
geom_jitter(alpha=0.2)+
stat_smooth(method = "lm",formula = y~x)
We see that density tends to increase with decreasing alcohol content. Let’s look at the correlation between the two and check if it’s true.
cor.test(wine$density,wine$alcohol)
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
This verifies the plot.
ggplot(wine,aes(x=alcohol,fill=factor(quality)))+
geom_density(alpha=0.2)
It looks like the red wines with a higher alcohol content tend to have a higher quality rating…what a surprise!
by(wine$alcohol, factor(wine$quality), summary)
## factor(wine$quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## factor(wine$quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## factor(wine$quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## factor(wine$quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## factor(wine$quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## factor(wine$quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
The above assertion can be verified since wine qualities of 7 and 8 have alcohol content higher than the rest.
ggplot(wine,aes(y=volatile.acidity,x=quality))+
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm")
The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes. The correlation coefficient between quality and volatile acidity is -0.39. This can be explained by the fact that volatile acidity at too high of levels can lead to an unpleasant, vinegar taste.
ggplot(wine,aes(y=sulphates,x=quality))+
geom_jitter(alpha=0.3) +
geom_smooth(method = "lm")
This is a weak positive relationship, but still higher the sulphates, higher the quality.
I observed a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol. I am not suprised at this result, because men tend to grade stronger wines as high quality, whereas wines with low percent alcohol are often not graded as such. High volatile acidity is also perceived to be undesirable because it impacts the taste of wines. Alcohol and volatile acidity don’t have any clear relationship between each other.
Yes, I observed positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, and negative relationship between pH and fixed acidity. Other variables either show very weak relationship or do not show any relationship.
With quality, alcohol is positively related whereas volatile.acidity is negatively related. I observed positive relationship between density and fixed acidity and negative between pH and fixed acidity. Other features of interest show weak relationship.
Now let’s visualise the relationship between volatile.acidity,alcohol and quality.
ggplot(aes(x=volatile.acidity,y=alcohol,color=factor(quality)),data=wine)+
geom_point()+
scale_color_brewer()+
labs(color="Quality level")+
xlab("Volatile acidity")+
ylab("alcohol level")
The plot shows tht higher quality wines are concentrated in top left corner, which signifies lower volatile.acidity and higher alcohol w.r.t quality, which we found in above analysis as well.
Now let’s analyze sulphate levels and alcohol wrt quality
ggplot(aes(x=sulphates,y=alcohol,color=factor(quality)),data=wine)+
geom_density2d(bins=2)+
scale_color_brewer()+
geom_point(color='black',alpha=0.1)
This shows that higher quality red wines are generally located near the upper right of the scatter plot (darker contour lines) wheras lower quality red wines are generally located in the bottom right.
ggplot(aes(x=sulphates,y=alcohol,color=factor(quality),size=volatile.acidity),data=wine)+
geom_point()
Let’s visualise wine_quality variable created with other factors.
ggplot(aes(y=volatile.acidity,x=density,color=wine_quality),data=wine)+
geom_point()+
scale_color_brewer()+
labs(color="Quality level")+
xlab("Density")+
ylab("Volatile acidity level")
The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity (y axis)
ggplot(aes(x=volatile.acidity,y=alcohol,color=wine_quality,size=citric.acid),data=wine)+
geom_jitter(alpha=0.1)+
xlab("Volatile acidity")+
ylab("Alcohol (%)")+
labs(color="Quality level",size="Citric acid level")+
ggtitle("Relationship between alcohol and volatile\n acidity w.r.t quality level and citric acid")
We can see that red dots are mostly concentrated in top left corner of the plot which signifies lower volatile acidity and higher alcohol.
ggplot(aes(x=fixed.acidity, y=volatile.acidity, color=wine_quality,size=pH),data=wine) +
geom_point()+
xlab("Fixed acidity ") +
ylab("Volatile acidity ") +
labs(color="Quality level",size="pH level")+
ggtitle("Relationship between fixed acidity and volatile\n acidity w.r.t quality level and pH level")
The distribution of low and average quality wines seem to be concentrated at fixed acidity values that are between 6 and 10. pH increases as fixed acidity decreases, and citric acid increases as fixed acidity increases.
ggplot(aes(x=fixed.acidity, y=alcohol, color=wine_quality,size=citric.acid),data=wine) +
geom_point()+
xlab("Fixed acidity ") +
ylab("Alcohol ") +
labs(color="Quality level",size="citric acid level")+
ggtitle("Relationship between fixed acidity and alcohol\n level w.r.t quality level and citric acid level")
ggplot(aes(x=residual.sugar,color=wine_quality),data=wine) +
geom_density()+
xlab("Residual sugar ") +
labs(color="Quality level")+
ggtitle("Relationship between residual sugar and quality level")
Now let’s generate a linear model based on above features.
m1<-lm(data=wine,quality~volatile.acidity)
m2<-update(m1,~.+alcohol)
mtable(m1,m2,sdigits=3)
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = wine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = wine)
##
## ==========================================
## m1 m2
## ------------------------------------------
## (Intercept) 6.566*** 3.095***
## (0.058) (0.184)
## volatile.acidity -1.761*** -1.384***
## (0.104) (0.095)
## alcohol 0.314***
## (0.016)
## ------------------------------------------
## R-squared 0.153 0.317
## adj. R-squared 0.152 0.316
## sigma 0.744 0.668
## F 287.444 370.379
## p 0.000 0.000
## Log-likelihood -1794.312 -1621.814
## Deviance 883.198 711.796
## AIC 3594.624 3251.628
## BIC 3610.756 3273.136
## N 1599 1599
## ==========================================
When looking at wine quality level, we see a positive relationship between fixed acidity and citric acid.
Residual sugar, supposed to play an important part in wine taste, actually has very little impact on wine quality.
Yes, I created 2 models.Their R squared values are under 0.4, so they do not provide us with enough explanation about the variability of the response data around their means.
ggplot(aes(x=factor(quality),fill=wine_quality),data=wine)+geom_bar()+
xlab("Quality level") +
ylab("Count of wines")+
labs(fill="Quality level")+
ggtitle("Number of wines w.r.t quality")
Most of the wine quality are rated as 5 or 6(). Although the rating scale is between 0 and 10, there’s no wine rated as 1,2,9 or 10.
ggplot(wine,aes(x=alcohol,fill=factor(quality)))+
geom_density(alpha=0.2)+
xlab("Alcohol level")+
labs(fill="Quality level")
I observed positive correlation between quality level and alcohol. Men tend to grade stronger wines as high quality, whereas wines with low percent alcohol are often not graded as such. Alcohol is the main carrier of aroma and bouquet and hence flavours of wine. Hence the plot justifies, the higher the alcohol level,more is the quality level of wine.
ggplot(aes(x=volatile.acidity,y=alcohol,color=wine_quality),data=wine)+
geom_point()+
scale_color_grey()+
labs(color="Quality level")+
xlab("Volatile acidity")+
ylab("alcohol level")+
theme_bw()
We observed the opposite direction to which quality levels are heading. Wine with high percent alcohol content and low volatile acidity tends to be rated as high quality wine. Based on the result, we can see that the volatile acidity in wine and percent alcohol content are two important components in the quality and taste of red wines.
The wines data set contains information on 1599 wines across twelve variables from around 2009. Although, there are less plots in the submission, but I did a lot visualisation and posted some of the plots I deemed useful. I had to go through each variable in the dataset, and yes it is tedious. But it was fun making this notebook. There was a trend between the volatile acidity of a wine and its quality. There was also a trend between the alcohol and its quality. There were very few wines that are rated as 1,2,9,10. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.